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ABSTRACT 


This paper studies the air quality data of Beijing from 2018 to 2020. On the 
basis of the correlation analysis of pollutant concentration, the circular 
neural network model based on LSTM algorithm is built to realize the 
prediction of AQI of Beijing. The results show that AQI index has a high 
correlation with PM2.5 and PM10, but only has a low negative correlation 
with 03. The prediction model of recurrent neural network shows high 
prediction accuracy. The research in this paper is helpful to promote the and 
application of recurrent neural network model in air quality data and time 


series data. 


How to cite this paper: Zeng Guojing | 
Jin Renhao "Predicting Beijing Air 
Quality Data Based on LSTM Method" 
Published in 
International 

Journal of Trend in 
Scientific Research 
Development 
(ijtsrd), ISSN: 2456- 
6470, Volume-5 | 


IJTSRD40000 








KEYWORDS: AQI; LSTM; Python; Keras; Pearson correlation 


1. Research background 

With the continuous development of economy and urban 
scale, Chinese development has entered a new era, and the 
people put forward higher requirements for urban air 
quality. As the sandstorm in March 2021, the air quality 
problem has once again become the focus of Beijing 
citizens. The monitoring and prediction of air quality is 
great practical significance in order to improve the air 
quality and the level of urban environmental construction. 


In order to better monitor and predict air quality, the 
national environmental protection department began to 
use air quality index (AQI) to quantitatively describe air 
quality from 2012. AQIl is a kind of conceptual index 
which simplifies the concentration of several air pollutants 
in conventional monitoring into a single form, and 
represents the degree of air pollution and air quality 
status by classification. It is suitable for representing the 
short-term air quality status and change trend of cities. 
With the development of data mining, more and more 
machine learning models are applied to the prediction of 
air quality. Bai Heming!?] used BP neural network to 
forecast the AQI index for different seasons in Beijing. By 
comparing the forecast value and monitoring value of 
different seasons, they concluded that the forecast 
accuracy of autumn is the highest. Li Jinglu and Zeng 
Tianl3] used the principal component analysis method to 
study the air quality data of Beijing from 2000 to 2011, 
and concluded that the per capita GDP and the output 
value of the tertiary industry had the greatest correlation 
with air quality. Wang Mingjie and He Jiajial*] used the 
method of mathematical statistics and typical circulation 
classification to study the AQI index. The results showed 
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that the main pollutants causing weather pollution were 
NO,. PM,;, and O3.Li Ping and Ni Zhiweill 
built a fractal popular learning support vector machine to 
predict AQI index. They adopt the method of fractal 
dimension first and then reduce the dimension, which 
improved the accuracy and stability of prediction. Xu Qi 
and Wu Qizhongl® used the comprehensive scoring 
method to monitor and forecast the PMz25 concentration in 
the air. Based on the WRF-CMAQ model system, their 
evaluation results showed that the accuracy was better 
than the official forecast. 


However, the air pollution index is a typical time series 
data. When using the traditional statistical model and the 
common neural network method to predict, the accuracy 
is not high enough and the calculation time is long. 
Recurrent neural network is a kind of neural network 
model with the input of time series data, which is more 
suitable for the modeling and prediction of time series 
data. LSTM solves the common problems of gradient 
disappearance and gradient explosion in _ traditional 
recurrent neural network. It isa common recurrent neural 
network algorithm and has many _ successful 
applications!71°] in predicting time series data. But at 
present, the research on the application of recurrent 
neural network model based on LSTM algorithm in air 
quality prediction is still lacking, especially in Beijing data. 
Therefore, this paper uses Python deep learning library 
keras to build LSTM recurrent neural network model to 
realize the prediction of Beijing air quality data, and 
selects AQI as the main index of air quality as the 
prediction target variable. 
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2. Theoretical basis 

2.1. Keras 

Keras is a powerful high-level neural network API written 
for python. It can use tensor flow, theano and cntk as the 
interfaces of high-level applications. Keras is one of the 
commonly used machine learning tools, which has four 
advantages: user-friendly, modular operation, strong 
scalability, and high collaboration with Python. It contains 
a large number of functions and program optimizers and 
other components. The optimizer included in Keras can 
realize back propagation algorithm and adaptive gradient 
descent algorithm, which is convenient for the 
implementation of LSTM recurrent neural network 
algorithm. 


2.2. Principle of LSTM neural network 

Long term and short-term memory network (LSTM) is a 
variant algorithm of recurrent neural network (RNN). By 
using time back propagation training, it can solve the 
problems of gradient disappearance and_ gradient 
explosion in common neural network method. It is widely 
used in image video recognition, stock price trend 
prediction, disease prediction and other fields. LSTM 
algorithm uses memory cells to replace conventional 
neurons in RNN. Memory cells are more _ flexible 
components than neurons, and memory modules are 
introduced. Each storage unit is composed of forgetting 


gate, input gate and output gate, and its structure is shown 
in Figure 1.In Fig 1:t represents the specific time, x;. 
X,_ and x;,,, represent the input sequence at t time, t— 1 
time and t+1 time respectively;h,. hy, and hy, 
represent the outputs of the memory cells at t time, t — 1 
time and t+1time respectively. The tanh is the 
hyperbolic tangent function and o is the sigmoid 
activation function. This function can transform to 
produce a smooth range value between 0 and 1, so as to 
observe the change of output value when the input value 
changes slightly. 


3. Construction of LSTM prediction model 

3.1. data sources 

This paper is based on the air quality data of Beijing from 
January 2018 to December 2020, and the data is from the 
website of China Weather Post 
(http://www.tiangihoubao.com/).A total of 1096 rows of 
observations were obtained. Data information includes 
daily AQI index and concentrations of six pollutants CO, 
NO. PM,5. SO,. 03. PM,,in Beijing.Due to the long 
sampling time and force majeure and other factors, some 
date data are missing. This paper uses the monthly mean 
of these seven kinds of data to borrow and supplement the 
missing values. The trend of AQI index and six kinds of 
pollutant values is shown in Figure 2. 





Fig 1 The structure of LSTM 
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Fig 2 Variation trend of AQI index and six pollutants 





@IJTSRD | Unique Paper ID-IJTSRD40000 | 


Volume-5|Issue-3 | 


March-April 2021 Page 775 


International Journal of Trend in Scientific Research and Development (IJTSRD) @ www.ijtsrd.com eISSN: 2456-6470 


3.2. Correlation analysis between AQI index and pollutants 

It can be seen from Fig 2 that the change trend of AQI and the concentrations of CO. NO,. PM,., SO,andPM,,in Beijing 
is roughly the same, When the AQI index becomes higher, the other five pollutants will also become higher. When the AQI 
index becomes lower, the other five pollutants will also become lower. Therefore, there is a positive correlation between 
AQI index and the concentrations of CO. NO,. PM,5. SO,andPM, ).However, when the AQI index becomes higher, the 
concentration of 03; becomes lower, so there is a negative correlation between AQI index and O,concentration.In order to 
further analyze the relationship between AQI and CO. NO,, PMz;5. SO2. O03. PMy,o, the Pearson correlation coefficient 
of each index is shown in Table 1.There was a positive correlation between AQI index and the concentrations of CO, 
NO. PM,.5. SO,andPM, , and a weak negative correlation between AQI index and O3concentration, with the coefficient 
value of - 0.08.PM2.5 and PM10 had the highest positive correlation with AQI, and the correlation coefficients respective 
were 0.936 and 0.785.Therefore, in the study of air pollution control in Beijing, we can formulate relevant policies from the 
perspective of controlling the emission of these two pollutants, and take certain measures to reduce the concentration of 
these two pollutants. 


Table1. Correlation coefficient matrix of AQI index and six pollutants in Beijing 





AQI 1 0.936 | 0.438 | 0.580 | 0.785 | -0.080 | 0.757 





PM25 | 0.936 1 0.492 | 0.659 | 0.624 | -0.043 | 0.857 
SO, | 0.438 | 0.492 1 0.619 | 0.413 | -0.258 | 0.624 
NO, | 0.580 | 0.659 | 0.619 1 0.503 | -0.453 | 0.718 
PM) | 0.785 | 0.624 | 0.413 | 0.503 1 -0.003 | 0.474 
0; -0.080 | -0.043 | -0.258 | -0.453 | -0.003 1 0.474 
co 0.757 | 0.857 | 0.624 | 0.718 | 0.464 | -0.172 1 















































4. Research on AQI prediction 

According to the correlation analysis of AQI index and six kinds of common pollutants, the air quality of the next day can 
be predicted by the historical data of these pollutant concentration indexes. This paper establishes a model for AQI, which 
is the main index to measure air quality. The next day's AQI index value is used as the prediction target variable, and the 
AQI index and the historical index value of six pollutants are used as the model input variables. The LSTM neural network 
algorithm program is supported by using Keras module in Python. Due to the difference of data scale between each index 
value, this paper uses the method of maximum and minimum to realize the normalization of each index data. In the LSTM 
model, there are 100 neurons in the hidden layer and only one neuron in the output layer; the first 70% of the sample data 
is used as training data, and the last 30% as test data. Finally, when comparing the difference between the predicted 
results of the model and the real values, the predicted results are de normalized. The fitting curve between the predicted 
value and the real value on the training set and the test set is shown in Figure 3. It can be seen from the figure that the 
prediction error of LSTM model on the training set and the test set is small, indicating that the model has high prediction 
accuracy. The average absolute error of the model in the training set and the test set are 3.31 and 5.17 respectively, and 
the average absolute error rate in the training set and the test set are 4.13% and 4.91% respectively, which further shows 
that the model has high prediction accuracy. In Figure 3, green represents the training set and red represents the test set 
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Fig 3 Prediction effect of LSTM model on training set and test set 
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5. Conclusion 

Based on the analysis of the concentration of air pollutants 
in Beijing from January 2018 to December 2020, this 
paper analyzes the air pollution index, the concentration 
change trend of six pollutants and the correlation of each 
pollutant index. The results show that there is a positive 
correlation between AQI and the concentrations of CO. 
NO2. PMz5. SO2. 03. PMyo, and a negative correlation 
between AQI and O3.Due to the nonlinear relationship 
between the AQI index and the concentration of these 
pollutants, the traditional statistical prediction method 
cannot achieve the ideal prediction accuracy. In this paper, 
the recurrent neural network model is used to establish 
the prediction model, and the long-term and short-term 
memory network (LSTM) is used for model operation. The 
results show that the model has high prediction accuracy. 
The results show that the recurrent neural network can be 
widely used in the area of air quality data prediction, and 
can also be extended to more time series data. 
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